{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#LASSO正则化"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LASSO（ least absolute shrinkage and selection operator，最小绝对值收缩和选择算子）方法与岭回归和LARS（least angle regression，最小角回归）很类似。与岭回归类似，它也是通过增加惩罚函数来判断、消除特征间的共线性。与LARS相似的是它也可以用作参数选择，通常得出一个相关系数的稀疏向量。\n",
    "\n",
    "<!-- TEASER_END -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Getting ready"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "岭回归也不是万能药。有时就需要用LASSO回归来建模。本主题将用不同的损失函数，因此就要用对应的效果评估方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How to do it..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先，我们还是用`make_regression`函数来建立数据集："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_regression\n",
    "reg_data, reg_target = make_regression(n_samples=200, n_features=500, n_informative=5, noise=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "之后，我们导入`lasso`对象："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Lasso\n",
    "lasso = Lasso()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`lasso`包含很多参数，但是最意思的参数是`alpha`，用来调整` lasso`的惩罚项，在**How it works...**会具体介绍。现在我们用默认值`1`。另外，和岭回归类似，如果设置为`0`，那么`lasso`就是线性回归："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,\n",
       "   normalize=False, positive=False, precompute=False, random_state=None,\n",
       "   selection='cyclic', tol=0.0001, warm_start=False)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lasso.fit(reg_data, reg_target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "再让我们看看还有多少相关系数非零："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "np.sum(lasso.coef_ != 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "d:\\programfiles\\Miniconda3\\lib\\site-packages\\IPython\\kernel\\__main__.py:2: UserWarning: With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator\n",
      "  from IPython.kernel.zmq import kernelapp as app\n",
      "d:\\programfiles\\Miniconda3\\lib\\site-packages\\sklearn\\linear_model\\coordinate_descent.py:432: UserWarning: Coordinate descent with alpha=0 may lead to unexpected results and is discouraged.\n",
      "  positive)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "500"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lasso_0 = Lasso(0)\n",
    "lasso_0.fit(reg_data, reg_target)\n",
    "np.sum(lasso_0.coef_ != 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "和我们设想的一样，如果用线性回归，没有一个相关系数变成0。而且，如果你这么运行代码，scikit-learn会给出建议，就像上面显示的那样。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How it works..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对线性回归来说，我们是最小化残差平方和。而LASSO回归里，我们还是最小化残差平方和，但是加了一个惩罚项会导致稀疏。如下所示：\n",
    "\n",
    "$$\\sum {e_i + \\lambda \\\n",
    "{\\begin{Vmatrix}\n",
    "\\beta \n",
    "\\end{Vmatrix}}_1}\n",
    "$$\n",
    "\n",
    "最小化残差平方和的另一种表达方式是：\n",
    "\n",
    "$$ RSS(\\beta),其中\n",
    "{\\begin{Vmatrix}\n",
    "\\beta \n",
    "\\end{Vmatrix}}_1\n",
    "\\lt \\beta\n",
    "$$\n",
    "\n",
    "这个约束会让数据稀疏。LASSO回归的约束创建了围绕原点的超立方体（相关系数是轴），也就意味着大多数点都在各个顶点上，那里相关系数为0。而岭回归创建的是超平面，因为其约束是L2范数，少一个约束，但是即使有限制相关系数也不会变成0。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###LASSO交叉检验"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面的公式中，选择适当的$\\lambda$（在scikit-learn的`Lasso`里面是`alpha`，但是书上都是$\\lambda$）参数是关键。我们可以自己设置，也可以通过交叉检验来获取最优参数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,\n",
       "    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,\n",
       "    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,\n",
       "    verbose=False)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LassoCV\n",
    "lassocv = LassoCV()\n",
    "lassocv.fit(reg_data, reg_target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`lassocv`有一个属性就是确定最合适的$\\lambda$："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.58535963603062136"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lassocv.alpha_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "计算的相关系数也可以看到："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.       , -0.       ,  0.       ,  0.0192606, -0.       ])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lassocv.coef_[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用最近的参数拟合后，`lassocv`的非零相关系数有29个："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "29"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.sum(lassocv.coef_ != 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###LASSO特征选择"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LASSO通常用来为其他方法所特征选择。例如，你可能会用LASSO回归获取适当的特征变量，然后在其他算法中使用。\n",
    "\n",
    "要获取想要的特征，需要创建一个非零相关系数的列向量，然后再其他算法拟合："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(200, 29)"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mask = lassocv.coef_ != 0\n",
    "new_reg_data = reg_data[:, mask]\n",
    "new_reg_data.shape"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
